Yule Duan, Xuao Wu, Haoyu Deng, Liang-Jian Deng
University of Electronic Science and Technology of China, China
Currently, machine learning-based methods for remote sensing pansharpening have progressed rapidly. However, existing pansharpening methods often do not fully exploit differentiating regional information in non-local spaces, thereby limiting the effectiveness of the methods and resulting in redundant learning parameters.
In this paper, we introduce a so-called content-adaptive non-local convolution (CANConv), a novel method tailored for remote sensing image pansharpening. Specifically, CANConv employs adaptive convolution, ensuring spatial adaptability, and incorporates non-local self-similarity through the similarity relationship partition (SRP) and the partitionwise adaptive convolution (PWAC) sub-modules.
Furthermore, we also propose a corresponding network architecture, called CANNet, which mainly utilizes the multi-scale self-similarity. Extensive experiments demonstrate the superior performance of CANConv, compared with recent promising fusion methods. Besides, we substantiate the method’s effectiveness through visualization, ablation experiments, and comparison with existing methods on multiple test sets. The source code is publicly available at https://github.com/duanyll/CANConv
Images captured by remote sensing satellites:
We hope to obtain High-Resolution Multi-Spectral (HRMS) images.
\mathrm{PAN} + \mathrm{UpSample}(\mathrm{LRMS}) = \mathrm{HRMS}
[1] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic Filter Networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2016.
[2] Hang Su, V. Jampani, Deqing Sun, Orazio Gallo, Erik G. Learned-Miller, and Jan Kautz. Pixel-adaptive convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1115811167, 2019.
[3] Shangchen Zhou, Jiawei Zhang, Wangmeng Zuo, and Chen Change Loy. Cross-Scale Internal Graph Neural Network for Image Super-Resolution, 2020. 2, 3, 1
Remote sensing images consists of regions (segments) of same semantics, and these regions have very wide span.
Previous works use kNN to capture non-local features in an image, in which k nearest neighbors are far from enough in practice, and the increase of k will introduce heavy and redundant computational overhead to the model. Thus, this paper turns to a novel clustering approach instead, by which pixels sharing similar features are clustered and saved in sets.
Input feature map: X \in \mathbb{R}^{\overbrace{H}^{\text{height}} \times \overbrace{W}^\text{width} \times \overbrace{C^\text{in}}^\text{input channels}}
For pixel X_{xy} (a vector), apply spatial mean pooling in a k \times k neighborhood to obtain \boldsymbol{f}_{xy};
\boldsymbol{f}_{xy} = \frac{1}{k^2}\sum_{i=-\lfloor\frac{k}{2}\rfloor}^{\lfloor\frac{k}{2}\rfloor}\sum_{j=-\lfloor\frac{k}{2}\rfloor}^{\lfloor\frac{k}{2}\rfloor}X_{ij}
Apply k-Means to obtain cluster index matrix \boldsymbol{I}, where \boldsymbol{I}_{xy} denotes the index number of the cluster to which pixel X_{xy} belongs.
Construct SRP:
S_i=\{(x,y)|\boldsymbol{I}_{xy}=i\}
The core problem is how to construct a matrix for convolving.
Firstly, find the centroid of each cluster:
\boldsymbol{c}_i = \frac{1}{|S_i|}\sum_{(x,y)\in S_i}\boldsymbol{p}_{xy},
where \boldsymbol{p}_{xy} \in \mathbb{R}^{k^2C_\text{in}} representing the unfold of \mathbb{R}^{k \times k \times C_\text{in}} input, i.e., rearrange the elements in the (x, y) centered k \times k area (with C_\text{in} channels) a vector \boldsymbol{p}_{xy}.
In the following parts, the centroid \boldsymbol{c}_i will be used to represent S_i.
Deal with outlier pixels: If |S_i|<\eta \cdot HW (threshold), \boldsymbol{c}_i=\frac{1}{HW}\sum_{(x,y)\in U} \boldsymbol{p}_{xy}.
A global kernel parameter \boldsymbol{W} \in \mathbb{R}^{C_\text{in} × k^2 × C_\text{out}}.
Perceptron: \boldsymbol{w}_\text{cin} \in \mathbb{R}^{C_\text{in}}, \boldsymbol{w}_s \in \mathbb{R}^{k^2}, \boldsymbol{w}_\text{cout} \in \mathbb{R}^{C_\text{out}}.
Adaptive kernel: f_\text{k}(\boldsymbol{c_i}) = \boldsymbol{w}_\text{cin} ⊛ \boldsymbol{w}_\text{s} ⊛ \boldsymbol{w}_\text{cout} ⊙ \boldsymbol{W}.
⊛ refers to Keonecker product, defined by: A_{m \times n}⊛B_{p \times q}=\begin{bmatrix} a_{11}B&a_{12}B&\cdots&a_{1n}B\\ a_{21}B&a_{22}B&\cdots&a_{2n}B\\ \vdots&\vdots&\ddots&\vdots\\ a_{m1}B&a_{m2}B&\cdots&a_{mn}B\\ \end{bmatrix}.
⊙ refers to element-wise product.
Convolution output: \boldsymbol{Y}_{xy} = \boldsymbol{p}_{xy} \otimes f_\text{k}(\boldsymbol{c}_{\boldsymbol{I}_{xy}}) + f_\text{b}(\boldsymbol{c}_{\boldsymbol{I}_{xy}}).
In the proposed CANNet:
Ablate SRP, and treat the entire feature map as a single cluster or each pixel as a cluster.
Ablate PWAC, and use Multi-Layer Perceptron instead.
BTW: Some ideas in this paper may be helpful in paddy classification.
This part lists the definition of several metrics used in this paper. For more information, see also Multispectral and Panchromatic DataFusion Assessment Without Reference [4].
[4] Alparone, Luciano & Aiazzi, Bruno & Baronti, Stefano & Garzelli, Andrea & Nencini, Filippo & Selva, Massimo. (2008). Multispectral and Panchromatic Data Fusion Assessment Without Reference. ASPRS Journal of Photogrammetric Engineering and Remote Sensing. 74. 193-200. 10.14358/PERS.74.2.193.
\mathrm{SAM} = \arccos \frac{\vec{x}^\mathsf{T}\vec{y}}{\|\vec{x}\|\|\vec{y}\|},
where \vec x is the test spectra, and \vec y is the reference spectra. The smaller \mathrm{SAM} is, the higher the probability that \vec x and \vec y refer to the same type of object.
\mathrm{ERGAS} is a French acronym, meaning Relative Dimensionless Global Error.
\mathrm{ERGAS} = 100\;\frac{d_\text{PAN}}{d_{LRMS}}\sqrt{\frac{1}{L}\sum_{l=1}^L\left(\frac{\text{RMSE}_l}{\mu_l}\right)^2}
[5] band, i.e., channel
\mathrm{QI} means Quality Index.
A pansharpened multi-spectral (MS) image has four bands. The quality index \mathrm{Q4} is a generalization to four-band images of the \mathrm{QI} [6], which can be applied only to monochrome [7] images.
The calculation of \mathrm{QI} can be generalized to n-band images.
[6] Wang, Z., and A.C. Bovik, 2002. A universal image quality index, IEEE Signal Processing Letters, 9(3):81–84.
[7] black-and-white
D_\lambda = \sqrt[q]{\frac{1}{L(L-1)}\sum_{l=1}^L\sum_{r=1}^L\left|Q\left(\hat{G}_l, \hat{G}_r\right) - Q\left(\tilde{G}_l, \tilde{G_r}\right)\right|^p}
D_s = \sqrt[q]{\frac{1}{L}\sum_{l=1}^L\left|Q\left(\hat{G}_l, P\right) - Q\left(\tilde{G}_l, \tilde{P}\right)\right|}
The product of the one’s complements of the spatial and spectral distortion indices, each raised to a real-valued exponent that attributes the relevance of spectral and spatial distortions to the overall quality.
\mathrm{QNR} = \left(1 - D_\lambda\right)^\alpha\cdot\left(1 - D_s\right)^\beta, \alpha, \beta \in \mathbb{R}